Journal of Cheminformatics — Latest Matching Preprints

1

Fragment-Based Ligand Generation Guided By Geometric Deep Learning On Protein-Ligand Structure

Dror, R.; Powers, A.; Suriana, P.; Yu, H.

2022-03-19 bioengineering 10.1101/2022.03.17.484653 medRxiv

Top 0.1%

50.4%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWComputationally-aided design of novel molecules has the potential to accelerate drug discovery. Several recent generative models aimed to create new molecules for specific protein targets. However, a rate limiting step in drug development is molecule optimization, which can take several years due to the challenge of optimizing multiple molecular properties at once. We developed a method to solve a specific molecular optimization problem in silico: expanding a small, fragment-like starting molecule bound to a protein pocket into a larger molecule that matches that physiochemical properties of known drugs. Using data-efficient E(3) equivariant based neural networks and a 3D atomic point cloud representation, our model learns how to attach new molecular fragments to a growing structure by recognizing realistic intermediates generated en route to a final ligand. This approach always generates chemically valid molecules and incorporates all relevant 3D spatial information from the protein pocket. This framework produces promising molecules as assessed by multiple properties that address binding affinity, ease of synthesis, and solubility. Overall, we demonstrate the feasibility of 3D molecular structure expansion conditioned on protein pockets while maintaining desirable drug-like physiochemical properties and developed a tool that could accelerate the work of medicinal chemists.

2

BBB-Nuke: Transport-Aware Prediction of Blood-Brain Barrier Penetration in Small Molecules

Abasciano, N.; Hadipour, H.; Poddar, A.; Rudrum, J.; Sobodu, T.

2026-07-14 bioengineering 10.64898/2026.07.13.738280 medRxiv

Top 0.1%

35.9%

Show abstract

Predicting blood-brain barrier (BBB) penetration remains a central challenge in CNS drug discovery. Existing computational models rely on physicochemical descriptors and are blind to active transport biology - the efflux pumps and carrier proteins that dominate drug exclusion at the BBB in vivo. We present BBB-Nuke, a modular prediction pipeline that integrates physicochemical scoring with explicit efflux transporter substrate modeling. The system computes ten molecular descriptors, predicts ionization state via a graph convolutional network, scores CNS-MPO desirability, and estimates substrate probability for seven efflux transporters (P-gp/MDR1, BCRP/ABCG2, MRP1, MRP2, MRP4, MATE1, OAT3) using Random Forest classifiers trained on curated ChEMBL bioactivity data. A gradient-boosted classifier trained on 67 features - ten physicochemical, seven efflux transporter probabilities, and fifty fingerprint-derived principal components - achieves an area under the receiver operating characteristic curve (AUROC) of 0.933 {+/-} 0.006 under five-fold cross-validation on 9,262 labeled compounds, and 0.810 on a fully held-out benchmark of 470 clinically validated compounds. In head-to-head comparisons, BBB-Nuke outperforms CNS-MPO, LightBBB, ADMETlab 2.0, and BBB-Score on both cross-validation and external test sets. We apply the pipeline to screen over one billion commercially available compounds from the Enamine REAL library and PubChem, identifying enriched regions of BBB-penetrant chemical space and characterizing the structural features that distinguish permeable from excluded molecules. BBB-Nuke is freely available as a Python package, REST API, and Model Context Protocol server.

3

Do chemical language models provide a better compound representation?

Torrisi, M.; Asadollahi, S.; de la Vega de Leon, A.; Wang, K.; Copeland, W.

2023-11-10 bioinformatics 10.1101/2023.11.07.566025 medRxiv

Top 0.1%

27.1%

Show abstract

In recent years, several chemical language models have been developed, inspired by the success of protein language models and advancements in natural language processing. In this study, we explore whether pre-training a chemical language model on billion-scale compound datasets, such as Enamine and ZINC20, can lead to improved compound representation in the drug space. We compare the learned representations of these models with the de facto standard compound representation, and evaluate their potential application in drug discovery and development by benchmarking them on biophysics, physiology, and physical chemistry datasets. Our findings suggest that the conventional masked language modeling approach on these extensive pre-training datasets is insufficient in enhancing compound representations. This highlights the need for additional physicochemical inductive bias in the modeling beyond scaling the dataset size.

4

ToxiVerse: A Public Platform for Chemical Toxicity Data Sharing and Customizable Predictive Modeling

Durai, P.; Russo, D. P.; Shen, Y.; Wang, T.; Chung, E.; Li, L.; Zhu, H.

2026-03-02 bioinformatics 10.64898/2026.02.26.708255 medRxiv

Top 0.1%

26.7%

Show abstract

Chemical toxicity assessment is critical for drug development and environmental safety. Computational models have emerged as a promising alternative to animal testing and now play a significant role in efficiently evaluating new chemicals. To address the urgent need for providing user-friendly machine learning tools in computational toxicology, we developed ToxiVerse, a public web-based platform. It provides curated toxicity datasets, automatic chemical bioprofiling, and a predictive modeling interface designed for researchers who lack programming expertise. The platform comprises three integrated modules: (i) the Bioprofiler module, which provides chemical descriptors by combining chemical-bioactivity data from PubChem assay with a machine learning-based data gap-filling procedure; (ii) the Database module, which hosts around 50,000 curated unique chemicals covering diverse toxicity endpoints; and (iii) the Cheminformatics module, which allows users to upload their own datasets, use datasets from ToxiVerse, or retrieve existing data from PubChem; perform chemical curation; and automatically generate Quantitative Structure-Activity Relationship (QSAR) models to predict chemicals of interest. ToxiVerse enables researchers to carry out bioprofiling, access curated toxicity datasets, and evaluate chemical toxicity through machine learning-based modeling and prediction. The platform is supported by sample files and a detailed tutorial, and it is freely accessible at www.toxiverse.com. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/708255v1_ufig1.gif" ALT="Figure 1"> View larger version (22K): org.highwire.dtl.DTLVardef@d92764org.highwire.dtl.DTLVardef@a92f4aorg.highwire.dtl.DTLVardef@15fa39corg.highwire.dtl.DTLVardef@1ee89bc_HPS_FORMAT_FIGEXP M_FIG C_FIG

5

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

Rao, J.; Zheng, S.; Song, Y.; Chen, J.; Li, C.; Xie, J.; Yang, H.; Chen, H.; Yang, Y.

2021-01-16 bioinformatics 10.1101/2021.01.13.426489 medRxiv

Top 0.1%

23.2%

Show abstract

SummaryRecently, novel representation learning algorithms have shown potential for predicting molecular properties. However, unified frameworks have not yet emerged for fairly measuring algorithmic progress, and experimental procedures of different representation models often lack rigorousness and are hardly reproducible. Herein, we have developed MolRep by unifying 16 state-of-the-art models across 4 popular molecular representations for application and comparison. Furthermore, we ran more than 12.5 million experiments to optimize hyperparameters for each method on 12 common benchmark data sets. As a result, CMPNN achieves the best results ranked the 1st in 5 out of 12 tasks with an average rank of 1.75. Relatively, ECC has good performance in classification tasks and MAT good for regression (both ranked 1st for 3 tasks) with an average rank of 2.71 and 2.6, respectively. AvailabilityThe source code is available at: https://github.com/biomed-AI/MolRep Supplementary informationSupplementary data are available online.

6

Model Choice Metrics to Optimize Profile-QSAR Performance

He, S.; Kim, S.; McLoughlin, K. S.; Ranganathan, H.; Shi, D.; Allen, J. E.

2022-08-25 bioinformatics 10.1101/2022.08.22.504151 medRxiv

Top 0.1%

23.0%

Show abstract

BackgroundPredicting molecular activity against protein targets is difficult because of the paucity of experimental data. Approaches like multitask modeling and collaborative filtering seek to improve model accuracy by leveraging results from multiple targets, but are limited because different compounds are measured with different assays, leading to sparse data matrices. Profile-QSAR (pQSAR) 2.0 addresses this problem by fitting a series of partial least squares models for each target, using as features the predictions from single-task models on the remaining targets. This method has been shown to produce better results than single task and multitask models. However, the factors determining the success of pQSAR 2.0 have as yet not been characterized. In this paper we examine the experimental conditions that lead to better pQSAR models. We limit the amount of data available to the method by retraining with decreasing amounts of data and explore the models ability to generalize to compounds that have never been assayed. Finally, we look at the properties of training data needed to demonstrate pQSAR improvement. ResultsWe apply pQSAR 2.0 on a collection of GPCR and safety targets collected from Drug Target Commons, ExcapeDB, and ChEMBL. We found that pQSAR improved models on 34 of the 149 assays selected. In the other 115 assays, single task random forests offered better performance. There are many factors that contribute to an increase in performance, but the main factor is compound assay coverage. The pQSAR model improves when more compounds are measured in multiple assays. ConclusionIt is necessary to consider the available data before applying pQSAR. Successful pQSAR models require a profile made of correlated targets that share compounds with other assays. This technique is best used when experimental data is available as random forest regressors often do not generalize well enough for virtual drug search applications.

7

Predicting small-molecule inhibition of protein complexes

Yaseen, A.; Roy, S.; Akhter, N.; Ben-Hur, A.; Minhas, F.

2024-08-23 bioinformatics 10.1101/2024.08.23.609286 medRxiv

Top 0.1%

22.7%

Show abstract

MotivationProtein-Protein Interactions (PPIs) are crucial in biological processes and disease mechanisms, underscoring the importance of discovering PPI inhibitors in drug development. Machine learning can expedite this discovery process. Although machine learning techniques for predicting general compound inhibition are available, we are not aware of any that accurately forecast the inhibitory effect of a compound on a specific protein complex, utilizing inputs from both the compound and the protein complex. MethodsWe present the first targeted machine learning based predictor of small molecule based inhibition of protein complexes. Our proposed graph neural network integrates the structure of a protein complex, its protein-protein binding site or interface features and a compounds SMILES representation to predict the potential of the given compound to inhibit the interaction between proteins in the given complex in a targeted manner. ResultsValidated on the 2p2i-DB-v2 database, encompassing 714 inhibitors across 23 complexes with over 12,000 instances, our model achieves superior predictive accuracy (cross-validation AUC-ROC of 0.86), outperforming established kernel methods and pre-trained neural networks. We further tested the predictive performance of our model on two independent external datasets - one collected from recent publications and another consisting of putative inhibitors of the SARS-CoV-2-Spike and Human-ACE2 protein complex with AUC-ROCs of 0.82 and 0.78, respectively. Our targeted predictor introduces a novel approach for PPI inhibitor discovery, laying foundational work for future advancements in addressing this complex and previously unexplored prediction challenge. AvailabilityCode/supplementary material available: https://github.com/adibayaseen/PPI-Inhibitors

8

Evaluation of search-enabled Pre-trained Large Language Models on retrieval tasks for the PubChem Database

Sze, A.; Hassoun, S.

2024-08-19 bioinformatics 10.1101/2024.08.15.608120 medRxiv

Top 0.1%

22.5%

Show abstract

Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. We investigate in this study the current state of using pre-trained, search-enabled LLMs for data retrieval from biological databases. Equipped with internet search and code generation capabilities, LLMs promise to streamline database access through natural language, expedite search and knowledge retrieval, and provide coherent analytical summaries. As an example database, we focus on evaluating a current search-enabled LLMs (GPT-4o) for retrieval from the PubChem database, a flagship, heavily used database that plays a critical role in biological and biomedical research. As PubChem is an open archival repository, it provides a well-documented programmatic interface that can be exploited through LLM code generation capabilities. We evaluate retrieval tasks for eight common PubChem access protocols that were previously documented. The tasks include identifying interacting genes and proteins, finding drug-like compounds based on structural similarity, retrieving bioactivity data, and locating stereoisomers and isotopomers. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval with and without using programmatic access. We compare the results (referred to as gold and silver answers) when using these retrieval modalities with two traditional retrieval baselines that include running the manual search steps for each reference protocol through the PubChem database web interface, and through the provided PUG (Power-User Gateway) programmatic access. We quantitatively and qualitatively summarize our results, showing that generating programmatic access is more likely to yield the correct answers. We highlight the value and limitations of using current search-based LLMs for database retrieval. We also provide guidance for the future development that can improve the accuracy and reliability of search-based LLMs.

9

Leveraging Transfer Learning for Predicting Protein-Small Molecule Interactions

Wang, J.; Dokholyan, N. V.

2024-10-12 bioinformatics 10.1101/2024.10.08.617219 medRxiv

Top 0.1%

19.3%

Show abstract

A complex web of intermolecular interactions defines and regulates biological processes. Understanding this web has been particularly challenging because of the sheer number of actors in biological systems: [~]104 proteins in a typical human cell offer a plausible 108 interactions. This number grows rapidly if we consider metabolites, drugs, nutrients, and other biological molecules. The relative strength of interactions also critically affects these biological processes. However, the small and often incomplete datasets (103-104 protein-ligand interactions) traditionally used for binding affinity predictions limit the ability to capture the full complexity of these interactions. To overcome this challenge, we developed Yuel 2, a novel neural network-based approach that leverages transfer learning to address the limitations of small datasets. Yuel 2 is pre-trained on a large-scale dataset to learn intricate structural features and then fine-tuned on specialized datasets like PDBbind to enhance the predictive accuracy and robustness. We show that Yuel 2 predicts multiple binding affinity metrics - Kd, Ki, and IC50 - between proteins and small molecules, offering a comprehensive representation of molecular interactions crucial for drug design and development.

10

BioPipelines: Accessible Computational Protein and Ligand Design for Chemical Biologists

Quargnali, G.; Rivera-Fuentes, P.

2026-03-13 bioinformatics 10.64898/2026.03.11.711024 medRxiv

Top 0.1%

19.0%

Show abstract

Deep learning methods for protein structure generation, sequence design, and structure and property prediction have created unprecedented opportunities for protein engineering and drug discovery. However, using these tools often requires navigating incompatible software environments, diverse input/output formats, and high-performance computing infrastructure, any of which may hinder adoption by primarily experimental chemical biology laboratories. Here we present BioPipelines, an open-source Python framework that allows researchers to define multi-step computational design workflows in a few lines of code. Additionally, its robust yet modular architecture provides a straightforward way to expand the toolkit with different functionalities, particularly by leveraging coding agents, with little effort. The framework currently integrates over 30 tools encompassing structure generation, sequence design, structure prediction, compound screening, and analysis. The same workflow code can be prototyped interactively in a Jupyter notebook and then submitted for production-scale runs without modification. We demonstrate applications in inverse folding, gene synthesis, de novo protein design, compound library screening, iterative binding site optimization, and fusion-protein linker optimization. We hope this framework will empower researchers, allowing them to focus on the scientific question rather than computational logistics. BioPipelines is available under the MIT license at https://github.com/locbp-uzh/biopipelines.

11

Interpreting biochemical text with language models:a machine learning framework for reaction extraction and cheminformatic validation

Lim, D.; Badrinarayanan, S.; Sterling, K. C.; Rajesh, G.; Mistry, E.; Yang, D.; Lee, M.; Hsu, K. B.; Manjrekar, M.; Areff, C.; Xie, P.; Kristanto, I. A.; Chandran, A.; Anderson, J. C.

2025-05-20 bioinformatics 10.1101/2025.05.15.654376 medRxiv

Top 0.1%

18.9%

Show abstract

Recent advancements in large language models (LLMs) offer new opportunities for automating the manual curation of biochemical reaction databases from scientific literature. In this study, we present an integrated pipeline that enhances LLM-based extraction of enzymatic reactions with machine learning and cheminformatics-informed validation. Using BRENDA-linked PubMed articles, we evaluate GPT-4s ability to extract reactions and infer missing chemical entities in textual descriptions of enzymatic reactions. Extracted reactions are converted to SMILES and InChI notations before being encoded into molecular fingerprint similarity scores and atom mapping metrics. These cheminformatics metrics are then used to train machine learning classifiers that validate GPT extractions. We employ a Positive-Unlabeled learning approach with synthetic invalid reactions to train various classifiers and assess model performances. The best classifier is then benchmarked on GPT extractions. Our findings show that GPT can accurately infer incomplete reactions and cheminformatics tools can serve as effective predictors of reaction validity. This work demonstrates a scalable framework for automated and reliable curation of enzymatic reaction databases, highlighting the potential of combining LLMs with cheminformatics and machine learning for reliable scientific knowledge extraction. Author SummaryCurating databases of biochemical reactions is a time-consuming and manual task, yet it plays a vital role in advancing research in biology and chemistry. Many scientific articles describe important enzymatic reactions, but often do so in incomplete ways--such as mentioning only the starting molecule or the enzyme, and leaving out the rest. In this work, we explore how recent advancements in artificial intelligence, specifically large language models like GPT, can help extract such information automatically from scientific literature. We show that these models can not only find reactions in text, but also infer missing parts of reactions based on the surrounding context. To make sure these inferred reactions are chemically plausible, we use computational chemistry tools that analyze the structure of the molecules involved. We then train a machine learning model to help us automatically detect which reactions are likely to be valid. This combination of tools offers a new way to speed up and improve how biochemical knowledge is extracted from the growing body of scientific literature. Our study suggests that this kind of automation could help scientists keep biological databases up to date and reduce the burden of manual data entry.

12

PanScreen: A Comprehensive Approach to Off-Target Liability Assessment

Sellner, M. S.; Lill, M. A.; Smiesko, M.

2023-11-17 bioinformatics 10.1101/2023.11.16.567496 medRxiv

Top 0.1%

18.8%

Show abstract

Drug development projects are getting increasingly more expensive while their success rate is stagnating. Safety issues attributed to off-target binding represent a major reason for the failure of new drugs. Besides desired on-target binding, small molecules may interact with off-targets, triggering adverse effects. Therefore, the development of novel methods for early recognition of such issues that are resource-efficient and cost-effective becomes vital. Here, we introduce PanScreen, an online platform for the automated assessment of off-target liabilities. PanScreen combines structure-based modeling techniques with state-of-the-art deep learning methods to not only predict accurate binding affinities but also give insight into potential modes of action. We show that the predictions are approaching experimental accuracy found in public datasets and that the same technology can also be used for other research areas, such as drug repurposing. Such fast and inexpensive methods allow researchers to test not only drug candidates, but all small molecules that might come into contact with a human organism for potential safety concerns very early in the development process. PanScreen is publicly available at www.panscreen.ch.

13

Cross-chemical and cross-species toxicity prediction: benchmarkingand a novel 3D-structure-based deep learning model

Yuan, R.; Shaw, J.; Tang, H.; Ye, Y.

2025-11-26 bioinformatics 10.1101/2025.11.24.690199 medRxiv

Top 0.1%

18.7%

Show abstract

Prediction of a compounds toxicity is a key step toward realizing animal-free testing of chemical compounds. Recent advances have yielded significant progress in computational toxicity prediction, including machine learning methods that utilize chemical fingerprints and deep learning-based latent representations. However, challenges remain, primarily due to the lack of clean training datasets and the inconsistent model performance. To address these challenges, we curated a comprehensive dataset of aquatic toxicity from seven data sources, which contains 50,603 records for 5,889 compounds across 2,285 different species, much larger than similar datasets used in previous studies. We also developed tox-learn, a Python library featuring tools for automated dataset cleaning, machine learning methods and performance evaluation. The library places special emphasis on avoiding overestimation of prediction accuracy caused by improper train-test data splitting. Based on this toolbox, we benchmarked various predictive models using different train-test splitting strategies on the curated dataset. Our results showed that the choice of machine learning method, molecular fingerprint, and train-test splitting strategy all significantly affect performance. We demonstrated that incorporating species information generally improved predictions, although the degree of improvement depended on how this information was represented. In addition, we developed a new 3D structure-based deep-learning model, 3DMol-Tox, which achieves regression accuracy comparable to the best 2D-structure based model (GPBoost) while exhibiting consistently higher within-one-bin (W1B) classification accuracy. Finally, we analyzed the impact of different train-test splitting strategies and provide recommendations based on our benchmarking, such as using structure-aware splitting to mitigate information leakage, a common issue that inflates reported model performance.

14

AI-Guided Discovery of LDHA Inhibitors Targeting Cancer Metabolism Using Machine Learning and Generative Chemistry: An End-to-End Drug Discovery Pipeline

Petalcorin, M. F.; Petalcorin, M. I. R.

2025-05-16 biochemistry 10.1101/2025.05.13.653702 medRxiv

Top 0.1%

18.7%

Show abstract

Targeting cancer metabolism has emerged as a promising therapeutic strategy, particularly through the inhibition of Lactate Dehydrogenase A (LDHA), a key enzyme that supports the Warburg effect in tumor cells. In this study, we present a comprehensive and fully reproducible machine learning (ML) and artificial intelligence (AI)-driven pipeline for the discovery of small-molecule LDHA inhibitors. By integrating bioactivity datasets from ChEMBL and BindingDB, along with natural products from COCONUT and AI-generated compounds from a ChemGPT-based molecular language model, we constructed a diverse and chemically rich screening library. Molecular descriptors were computed using Mordred, followed by feature selection, dataset balancing using SMOTE, and extensive model benchmarking across 11 classifiers. LightGBM was selected as the top-performing model with an AUC of 0.97. SHAP analysis provided model interpretability, revealing key molecular features influencing LDHA inhibition. Additionally, we trained ChemGPT on LDHA-specific SMILES in SELFIES format to generate 1,000 novel molecules, of which over 100 passed stringent drug-likeness, toxicity, and solubility filters. A subset exhibited high LDHA inhibition probabilities (>0.90) and structural novelty. This work highlights the potential of combining predictive modeling and generative chemistry for accelerating the early stages of cancer drug discovery and provides an open-source platform for continued development and validation.

15

Reference-free compound identification using computational prediction of molecular properties and multi-dimensional spectrometric measurements: a fentanyl case study

Harrilal, C. P.; Hollerbach, A. L.; Ciesielski, D.; Schultz, K. J.; Overstreet, R.; Rice, P. S.; King, E.; Nguyen, J.; Ross, D. H.; Lin, V. S.; Deng, G. Y.; Brayfindley, E.; Webb-Robertson, B.-J.; Raugei, S.; Ibrahim, Y. M.; Ewing, R. G.; Metz, T.

2026-04-27 scientific communication and education 10.64898/2026.04.22.719980 medRxiv

Top 0.1%

18.7%

Show abstract

Mass spectrometry is used to identify chemicals to which humans are exposed, but it cannot directly determine molecular structures. Instead, structures are inferred by matching experimental spectra to libraries of spectra constructed from analyses of pure reference compounds. However, the chemical space of human exposures far exceeds the amount of experimental library spectra. Here, we evaluate a reference-free strategy for confident identification of unknown molecules. Using fentanyl as a case study, we created a suspect library of over 1 billion computationally predicted fentanyl analogs and predicted molecular properties through machine learning, molecular dynamics, and density functional theory. Multi-dimensional spectra from a blinded analysis of a mock fentanyl tablet were matched with the predicted library, yielding an average of three candidate structures per measured analog, with six exact identifications. This work emphasizes the promise of reference-free molecular measurements for assessing human exposure by merging computational predictions with high-dimensional measurements.

16

A Predictive Model for Compound-Protein Interactions Based on Concatenated Vectorization

Williams, G.; Azim, K.

2024-10-03 bioinformatics 10.1101/2024.10.02.616275 medRxiv

Top 0.1%

18.6%

Show abstract

BackgroundLarge data sets of compound activity lend themselves to building predictive models based on compound and target structure. The simplest representation of structure is via vectorisation. Compound fingerprint vectorisation has been successfully employed in predicting compound activity classes. ResultsA vector representation of a protein-compound pair based on a concatenation of the compound fingerprint and the protein triplet vector has been used to train random forest and neural network models on multiple datasets of protein-compound interaction together with compound associated transcription and activity profiles. Results for compound-target predictability are comparable with more complex published methodologies. ConclusionA simple intuitive representation of a protein-compound pair can be employed in a variety of machine learning models to gain a predictive handle on the activity of compounds for which there is no activity data. It is hoped that this transparent approach will prove sufficiently portable and simple to implement that drug discovery will be opened up to the wider research community.

17

DiPPI: A curated dataset for drug-like molecules in protein-protein interfaces

Cankara, F.; Senyuz, S.; Sayin, A. Z.; Gursoy, A.; Keskin, O.

2023-08-14 bioinformatics 10.1101/2023.08.09.552637 medRxiv

Top 0.1%

18.4%

Show abstract

Proteins interact through their interfaces, and dysfunction of protein-protein interactions (PPIs) has been associated with various diseases. Therefore, investigating the properties of the drug-modulated PPIs and interface-targeting drugs is critical. Here, we present a curated large dataset for drug-like molecules in protein interfaces. We further present DiPPI (Drugs in Protein-Protein Interfaces), a two-module website to facilitate the search for such molecules and their properties by exploiting our dataset in drug repurposing studies. In the interface module of the website, we extracted several properties of interfaces, such as amino acid properties, hotspots, evolutionary conservation of drug-binding amino acids, and post-translational modifications of these residues. On the drug-like molecule side, we curated a list of drug-like small molecules and FDA-approved drugs from various databases and extracted those that bind to the interfaces. We further clustered the drugs based on their molecular fingerprints to confine the search for an alternative drug to a smaller space. Drug properties, including Lipinskis rules and various molecular descriptors, are also calculated and made available on the website to guide the selection of drug molecules. Our dataset contains 534,203 interfaces for 98,632 proteins, of which 55,135 are detected to bind to a drug-like molecule. 2,214 drug-like molecules are deposited on our website, among which 335 are FDA-approved. DiPPI provides users with an easy-to-follow scheme for drug repurposing studies through its well-curated and clustered interface and drug data; and is freely available at http://interactome.ku.edu.tr:8501.

18

Bioactivity assessment of natural compounds using machine learning models based on drug target similarity

Periwal, V.; Bassler, S.; Andrejev, S.; Gabrielli, N.; Typas, A.; Patil, K. R.

2020-11-08 bioinformatics 10.1101/2020.11.06.371112 medRxiv

Top 0.1%

18.4%

Show abstract

Natural compounds constitute a rich resource of potential small-molecule therapeutics. While experimental access to this resource is limited due to its vast diversity and difficulties in systematic purification, computational assessment of structural similarity with known therapeutic molecules offers a scalable approach. Here, we assessed functional similarity between natural compounds and approved drugs by combining multiple chemical similarity metrics and physicochemical properties through a random forest model. As a training set, we used pair-wise similarity between 1410 drugs in terms of their shared protein targets. The resulting model featured high performance metrics (matthews correlation coefficient of 0.81, and balanced accuracy of 0.91) suggesting that it well-captured the structure-activity relation. The model was then used to predict protein targets of circa 11k natural compounds by comparing them with the drugs. This revealed therapeutic potential of several natural compounds, including those with support from previously published sources as well as those hitherto unexplored. We experimentally validated one of the predicted links activities, viz., Cox-1 inhibition by 5-methoxysalicylic acid, a molecule commonly found in tea, herbs and spices. In contrast, another natural compound, 4-isopropylbenzoic acid, which showed a higher similarity when considering the most weighted similarity metric but was not picked by the random forest model, did not inhibit Cox-1. Our results demonstrate the utility of a machine-learning approach combining multiple chemical features for uncovering protein binding potential of natural compounds.

19

EnsDTI-kinase: Web-server for Predicting Kinase-Inhibitor Interactions with Ensemble Computational Methods and Its Applications

Lu, Y.; Lim, S.; Park, S.; Choi, M.; Cho, C.; Kang, S.; Kim, S.

2023-01-08 bioinformatics 10.1101/2023.01.06.523052 medRxiv

Top 0.1%

18.4%

Show abstract

MotivationKinase inhibitors are a major category of drugs. Experimental panel assay protocols are routinely used as a standard procedure to evaluate the efficiency and selectivity of a drug candidate to target kinase. However, current kinase panel assays are time-consuming and expensive. In addition, the panel assay protocols neither provide insights on binding sites nor allow experiments on mutated sequences or newly-characterized kinases. Existing virtual screening or docking simulation technologies require extensive computational resources, thus it is not practical to use them for the panel of kinases. With rapid advances in machine learning and deep learning technologies, a number of DTI tools have been developed over the years. However, these methods are yet to achieve prediction accuracies at the level of practical use. In addition, the performances of current DTI tools vary significantly depending on test sets. In this case, an ensemble model can be used to improve and stabilize DTI prediction accuracies. ResultsIn this work, we propose an ensemble model, EnsDTI-kinase, that integrates eight existing machine learning and deep learning models into a unified model deployed as a web-server. Upon submission of a compound SMILES string, potential target kinases are automatically predicted and evaluated on the web-server. Importantly, EnsDTI-kinase is a computational platform where newly developed DTI tools can be easily incorporated without modifying core components so that its DTI prediction quality can improve over time. Besides, many useful functionalities are provided on our platform for users to further investigate predicted DTI: it allows confidence experiments by changing the amino acid (AA) at a specific position in a kinase sequence, named in silico mutagenesis, to investigate the effect of AA changes in binding affinity; it predicts kinase sequential regions where the query compound likely binds to by slidingly masking the sequence of selected kinases so that confidence in the predicted binding sites can be evaluated. Our model was evaluated in three experimental settings using four independent datasets, and showed accuracy of 0.82 compared to the average accuracy of 0.69 from five deep learning methods on the ChEMBL dataset. It achieved average selectivity of 0.95 within kinase families such as TK, CAMK and STE. For 8 out of 17 recent drugs, our model successfully predicted their interactions with 404 proteins at average accuracy of 0.82. Availabilityhttp://biohealth.snu.ac.kr/software/ensdti Contactsunkim.bioinfo@snu.ac.kr

20

CPSign - Conformal Prediction for Cheminformatics Modeling

McShane, S. A.; Norinder, U.; Alvarsson, J.; Ahlberg, E.; Carlsson, L.; Spjuth, O.

2023-11-22 bioinformatics 10.1101/2023.11.21.568108 medRxiv

Top 0.1%

18.4%

Show abstract

Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4j models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data.